Fine-Tuning a Model with Hugging Face: The Sentiment Evolution
0. Setup and Environment
If using Google Colab, ensure you are on a GPU runtime: Runtime > Change runtime type > T4 GPU.
# Install the core Hugging Face libraries!pip install -q transformers[torch] datasets evaluate accelerate peftimport torchprint(f"Is GPU available? {torch.cuda.is_available()}")
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/84.1 kB ? eta -:--:-- ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.1/84.1 kB 5.2 MB/s eta 0:00:00
Is GPU available? True
This notebook walks through a practical transfer-learning workflow using the Hugging Face ecosystem.
We start with a general model, adapt it to movie-review sentiment, and compare performance before vs after fine-tuning.
Part A: The “Before” (The Raw Model)
We use transformers.pipeline with distilbert-base-uncased. This base model is a generalist (pretrained on large generic text corpora), so it may be less confident on domain-specific sentiment phrasing.
from transformers import pipelinebase_model_checkpoint ="distilbert-base-uncased"device =0if torch.cuda.is_available() else-1base_classifier = pipeline("sentiment-analysis", model=base_model_checkpoint, tokenizer=base_model_checkpoint, device=device)before_review ="The cast is great and there are a few strong scenes, but the story drags for too long."base_output = base_classifier(before_review)[0]print("Movie review:", before_review)print("Base model output:", base_output)
/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
DistilBertForSequenceClassification LOAD REPORT from: distilbert-base-uncased
Key | Status |
------------------------+------------+-
vocab_transform.weight | UNEXPECTED |
vocab_layer_norm.weight | UNEXPECTED |
vocab_layer_norm.bias | UNEXPECTED |
vocab_projector.bias | UNEXPECTED |
vocab_transform.bias | UNEXPECTED |
classifier.weight | MISSING |
classifier.bias | MISSING |
pre_classifier.bias | MISSING |
pre_classifier.weight | MISSING |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
Movie review: The cast is great and there are a few strong scenes, but the story drags for too long.
Base model output: {'label': 'LABEL_1', 'score': 0.5121102929115295}
Part B: Data Engineering with datasets
Dataset Introduction: Rotten Tomatoes
The rotten_tomatoes dataset is a binary sentiment classification benchmark built from short movie-review snippets.
Each row has: - text: the review sentence/snippet - label: sentiment (0 = NEGATIVE, 1 = POSITIVE)
The dataset is already split into train, validation, and test, which makes it ideal for demonstrating a clean fine-tuning workflow.
We load the dataset and create a small training subset with .shuffle().select(range(1000)).
Why datasets is useful: it supports efficient memory mapping, which lets you work with datasets larger than RAM.
First, we explicitly download the dataset to the local Hugging Face cache with load_dataset_builder(...).download_and_prepare().
from datasets import load_dataset_builderdataset_name ="rotten_tomatoes"builder = load_dataset_builder(dataset_name)builder.download_and_prepare()print(f"Dataset downloaded/prepared at: {builder.cache_dir}")
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 8530
})
validation: Dataset({
features: ['text', 'label'],
num_rows: 1066
})
test: Dataset({
features: ['text', 'label'],
num_rows: 1066
})
})
Train subset size: 1000
Eval subset size: 300
Example: {'text': '. . . plays like somebody spliced random moments of a chris rock routine into what is otherwise a cliche-riddled but self-serious spy thriller .', 'label': 0}
Part C: The Tokenizer (tokenizers)
We load AutoTokenizer from the base model and define a preprocessing function using padding=\"max_length\" and truncation=True.
Then we apply tokenization with dataset.map(tokenize_function, batched=True).
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained(base_model_checkpoint)def tokenize_function(examples):return tokenizer(examples["text"], padding="max_length", truncation=True)tokenized_train_dataset = small_train_dataset.map(tokenize_function, batched=True)tokenized_eval_dataset = small_eval_dataset.map(tokenize_function, batched=True)sample_text ="Hello"sample_ids = tokenizer(sample_text, add_special_tokens=True)["input_ids"]print("Sample text:", sample_text)print("Encoded IDs:", sample_ids)print("Decoded:", tokenizer.decode(sample_ids))print("You should see IDs similar to [101, 7592, 102] for this tokenizer.")
Sample text: Hello
Encoded IDs: [101, 7592, 102]
Decoded: [CLS] hello [SEP]
You should see IDs similar to [101, 7592, 102] for this tokenizer.
Side note
Fine-tuning is a form of transfer learning.
Intuition: lower layers capture broad language features, while higher layers and the task head specialize to your labeled task. In this tutorial, we freeze more general layers and train the classification head plus upper transformer layers, which is often data-efficient and stable for small datasets.
Part D: Fine-Tuning with the Trainer API
We load AutoModelForSequenceClassification with num_labels=2, configure TrainingArguments, and train with Trainer.
Fine-tuning techniques included in this tutorial: - Layer freezing: freeze embeddings + lower transformer layers, train upper layers/head. - Mixed precision: fp16=True when GPU is available for faster training. - Frequent logging: logging_steps=10 so students see progress often. - Gradient accumulation: simulate a larger effective batch size without extra VRAM. - Warmup + scheduler: smoother optimization with warmup_ratio and cosine decay. - Gradient clipping: control exploding updates with max_grad_norm. - Early stopping: stop when validation performance stops improving.
import numpy as npimport evaluatefrom transformers import ( AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding, EarlyStoppingCallback,)id2label = {0: "NEGATIVE", 1: "POSITIVE"}label2id = {"NEGATIVE": 0, "POSITIVE": 1}model = AutoModelForSequenceClassification.from_pretrained( base_model_checkpoint, num_labels=2, id2label=id2label, label2id=label2id,)# Technique 1: Freeze the most general layers; fine-tune upper layers + classification head.num_frozen_layers =4for param in model.distilbert.embeddings.parameters(): param.requires_grad =Falsefor layer in model.distilbert.transformer.layer[:num_frozen_layers]:for param in layer.parameters(): param.requires_grad =Falsetrainable =sum(p.numel() for p in model.parameters() if p.requires_grad)total =sum(p.numel() for p in model.parameters())print(f"Trainable params: {trainable:,} / {total:,} ({100* trainable / total:.2f}%)")metric = evaluate.load("accuracy")def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1)return metric.compute(predictions=predictions, references=labels)# Technique 2: Dynamic padding for faster batches than static max_length padding.data_collator = DataCollatorWithPadding(tokenizer=tokenizer)training_args = TrainingArguments( output_dir="./distilbert-rotten-tomatoes", learning_rate=2e-5, per_device_train_batch_size=8, per_device_eval_batch_size=16, gradient_accumulation_steps=2, num_train_epochs=3, weight_decay=0.01, eval_strategy="epoch", save_strategy="epoch", logging_steps=10, fp16=torch.cuda.is_available(), lr_scheduler_type="cosine", warmup_steps=0.1, max_grad_norm=1.0, load_best_model_at_end=True, metric_for_best_model="accuracy", greater_is_better=True, save_total_limit=2, report_to="none",)trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_train_dataset, eval_dataset=tokenized_eval_dataset, processing_class=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics, callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],)
DistilBertForSequenceClassification LOAD REPORT from: distilbert-base-uncased
Key | Status |
------------------------+------------+-
vocab_transform.weight | UNEXPECTED |
vocab_layer_norm.weight | UNEXPECTED |
vocab_layer_norm.bias | UNEXPECTED |
vocab_projector.bias | UNEXPECTED |
vocab_transform.bias | UNEXPECTED |
classifier.weight | MISSING |
classifier.bias | MISSING |
pre_classifier.bias | MISSING |
pre_classifier.weight | MISSING |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
There were missing keys in the checkpoint model loaded: ['distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias'].
There were unexpected keys in the checkpoint model loaded: ['distilbert.embeddings.LayerNorm.beta', 'distilbert.embeddings.LayerNorm.gamma'].
LoRA updates a small set of adapter weights instead of all model weights, which reduces memory use and speeds up tuning.
Use this when GPU memory is tight or you want faster experimentation.
from peft import LoraConfig, TaskType, get_peft_modellora_model = AutoModelForSequenceClassification.from_pretrained( base_model_checkpoint, num_labels=2, id2label=id2label, label2id=label2id,)lora_config = LoraConfig( task_type=TaskType.SEQ_CLS, r=8, lora_alpha=16, lora_dropout=0.1, target_modules=["q_lin", "v_lin"],)lora_model = get_peft_model(lora_model, lora_config)lora_model.print_trainable_parameters()# If you want to train LoRA adapters instead of the full model,# pass `lora_model` to Trainer(model=...) and keep the same training pipeline.
DistilBertForSequenceClassification LOAD REPORT from: distilbert-base-uncased
Key | Status |
------------------------+------------+-
vocab_transform.weight | UNEXPECTED |
vocab_layer_norm.weight | UNEXPECTED |
vocab_layer_norm.bias | UNEXPECTED |
vocab_projector.bias | UNEXPECTED |
vocab_transform.bias | UNEXPECTED |
classifier.weight | MISSING |
classifier.bias | MISSING |
pre_classifier.bias | MISSING |
pre_classifier.weight | MISSING |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
trainable params: 739,586 || all params: 67,694,596 || trainable%: 1.0925
Part E: Evaluation and Inference
Now we compute evaluation accuracy and compare predictions on the same movie review from Part A.
Movie review: The cast is great and there are a few strong scenes, but the story drags for too long.
Before fine-tuning: {'label': 'LABEL_1', 'score': 0.5121102929115295}
After fine-tuning: {'label': 'NEGATIVE', 'score': 0.7407522201538086}